Model Selection

Lightweight Multimodal

# Lightweight Multimodal

Hyperclovax SEED Vision Instruct 3B

HyperCLOVAX-SEED-Vision-Instruct-3B is a lightweight multimodal model developed by NAVER, featuring image-text understanding and text generation capabilities, with special optimization for Korean language processing.

naver-hyperclovax

A multimodal model trained based on google/gemma-3-4b-it, specializing in high-quality data processing for mathematics, programming, science, and puzzle-solving domains.

Transformers English

Heron NVILA Lite 1B

A Japanese visual language model trained based on the NVILA-Lite architecture, supporting image-text interaction in both Japanese and English

Safetensors Supports Multiple Languages

Smolvlm2 256M Video Instruct Mlx

This is a video-text-to-text model converted based on the MLX framework, suitable for video understanding and instruction-following tasks.

Transformers English

Smolvlm2 500M Video Instruct

A lightweight multimodal model designed for analyzing video content, capable of processing video, image, and text inputs to generate text outputs.

Transformers English

Smolvlm2 256M Video Instruct

SmolVLM2-256M-Video is a lightweight multimodal model specifically designed for analyzing video content, capable of processing video, image, and text inputs to generate text outputs.

Transformers English

T Lite It 1.0 Quants GGUF

T-lite-it-1.0 is an image-to-text model supporting Russian and English, converted from the GGUF format.

Large Language Model Supports Multiple Languages

nanoLLaVA-1.5 is a vision-language model with under 1 billion parameters, designed specifically for edge devices—compact yet powerful.

Transformers English

Imp V1.5 4B Phi3

Imp-v1.5-4B-Phi3 is a high-performance lightweight multimodal large model with only 4 billion parameters, built on the Phi-3 framework and SigLIP visual encoder.

Moondream2 Llamafile

moondream2 is a compact vision-language model specifically designed for efficient operation on edge devices, offering convenient deployment through the llamafile format.

nanoLLaVA is a 1B-parameter vision-language model specifically designed for edge devices, featuring efficient operation.

Transformers English

MiniCPM-V is an efficient lightweight multimodal model optimized for edge device deployment, supporting bilingual (Chinese-English) interaction and outperforming models of similar scale.

A 1.6B-parameter multimodal model combining SigLIP and Phi-1.5 architectures, supporting image understanding and Q&A tasks

Transformers English

Tiny Llava V1 Hf

TinyLLaVA is a compact large-scale multimodal model framework focused on vision-language tasks, featuring small parameter size yet excellent performance.

Transformers Supports Multiple Languages

UForm-Gen-Chat is the fine-tuned multimodal conversational version of UForm-Gen, primarily used for image caption generation and visual question answering tasks.

Transformers English

UForm-Gen is a small generative vision-language model primarily used for image caption generation and visual question answering.

Transformers English

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase